All Life bank - loan modelling - supervised learning - classification

Context

All Life Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

The classification goal is to predict the likelihood of a liability customer buying personal loans which means we have to build a model which will be used to predict which customer will most likely to accept the offer for personal loan, based on the specific relationship with the bank across various features given in the dataset.

Key questions to be answered

  1. To predict whether a liability customer will buy a personal loan or not?
  2. Which variables are most significant?
  3. Which segment of customers should be targeted more?

Data Information

The data contains the important demographic and banking details of the customers.

Attribute Information:

  1. ID: Customer ID
  2. Age: Customer’s age in completed years
  3. Experience: #years of professional experience
  4. Income: Annual income of the customer (in thousand dollars)
  5. ZIP Code: Home Address ZIP code.
  6. Family: the Family size of the customer
  7. CCAvg: Average spending on credit cards per month (in thousand dollars)
  8. Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  9. Mortgage: Value of house mortgage if any. (in thousand dollars)
  10. Personal_Loan: Did this customer accept the personal loan offered in the last campaign?
  11. Securities_Account: Does the customer have securities account with the bank?
  12. CD_Account: Does the customer have a certificate of deposit (CD) account with the bank?
  13. Online: Do customers use internet banking facilities?
  14. CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)?

Import required libraries

Define all required functions

Load the dataset in dataframe from the file

Understand data

Check the shape of data

List the columns

Check the datatypes of columns

Observations:

  1. All columns are of number type (int64 / float64)
  2. There are no columns with null value

Check the missing data

Observations:

There are no missing values

Check the duplicate records

Observations:

There are no duplicate values

Check the sample data

View the first and last 5 rows of the dataset.

Observations:

  1. ID column is running serial number and there is no additional value in considering this in model building.
  2. ZIP Code contains numerical data, but it contains categorical data. We will have to process it and convert it using the uszipcode library.
  3. Personal_Loan is our target variable
  4. Securities_Account, CD_Account, Online, CreditCard and Personal_Loan are Binary columns depicting whether or not Customer is using/used these facilities.
  5. Age, Experience, Income, CCAvg and Mortgage are Continuous features of the dataset.

Summary of dataset

Observations:

  1. The average age of a customer is 45 and the minimum & maximum are 23 and 67 respectively.
  2. There are 75% of the customers whose family size is less than or equal to 3.
  3. The average annual income of a customer is approximately $73.
  4. The minimum experience of a customer is -3 which indicates presence of erroneous data beacuse experience cannot be negative. Such values requires handling.
  5. All other data looks OK.

Feature Re-engineering

ZIPCode

Check the distinct values in the column

Based on uszipcode library documentation we can find County and State from the zipcode. Lets apply the defined functions and add those values to the dataset for further analysis

Check the data after update

Observations:

County and State columns are now added to the dataframe

Check the distinct values of state column

Observations:

  1. 4966 Rows have CA - California State updated
  2. 34 Rows doesn't have any values, it seems the library did not return any values and we have to try to impute those values

Lets check the null values of State Column

Observations:

  1. All 34 rows with missing state, has County column as NULL as well.
  2. Earlier assumption that library did not return the values for state and county was correct.
  3. Based on Google Search below are the values that are to be updated
    • Ventura - CA, 93077
    • Irvine - CA, 92717
    • Fullerton - CA, 92634
    • San Francisco - CA, 96651
  4. The missing data is also for ZIPCodes starting with 9, this means all the data in this dataset is for California state only.
  5. All data being from one state makes the State Column Absolute and we can exclude the same from analysis.
  6. However County can be valuable information in our analysis and we can use it for data analysis once the values are updated as per point 3.
  7. Also the other attributes of ZIP Code like city and locality if considered for processing will add a lot of variables, hence we can just use County column to do the analysis and find certain patterns based on people belonging to them.

Update the County and State columns

Verify the update

Observations:

  1. State and County Columns are updated for missing values.
  2. ZIPCode data translation is complete and we can use the new columns for our analysis.

Data Preprocessing

Experience

We have seen that experience has some negative values and that needs to be corrected.

Check the number of records with Negative Experience.

Observations:

  1. There are 52 rows with negative experience
  2. 33 Records have -1, 15 have -2 and 4 have -3 as experience
  3. We need to update these rows with related information from the dataset

Check the data for the rows with Negative Experience

Observations:

  1. We see that all customers with Negative experience have income more than USD 10,000 per annum.
  2. Few customers are earning more than USD 100,000 per annum.
  3. All the customers are above 20 years of Age and fall in the earners category.
  4. All the customer have at least Under Graduate Degree.
  5. Some customers also have Mortgage amounting to more that USD 80,000.
  6. It looks like there has been an error in data capture.

Check the correlation of Experience with other variables

Observations:

  1. Experience is highly correlated with Age.
  2. Education is positively correlated with Age.
  3. Both Education and Age are factors that impact the experience and we can use data from these columns with Positive values to impute the values of rows where the Experience is negative.
  4. From the domain knowledge we also know that Income and Mortgage can also play an important role in imputing the values.

Validate the update

Observations:

There are no records with negative experience

Check the dataset info after data pre processing

Observations:

  1. We see the experience data is corrected and minimum experience is now "0" instead of earlier "-3".
  2. Pre-processing is complete and we can proceed with analysis.

Univariate Analysis

ID

Observations:

  1. ID Column shows uniform distribution.
  2. ID Column is to identify the customer record in dataset, however it will not add any value to the analysis.
  3. We can exclude ID from the dependent variables list.

Age

Observations:

  1. Age Column shows normal distribution.
  2. The mean is slightly bigger than the median.
  3. The data can used for further analysis.

Experience

Observations:

  1. Experience Column shows normal distribution.
  2. There is hardly any difference between mean and median.
  3. The data can used for further analysis.

Income

Observations:

  1. Income Column shows positive skewness.
  2. Few customers earn more than USD 200,000 per annum.
  3. Income has outliers and we will try to scale the variable and see if that reduces the skewness.

ZIPCode

Observations:

  1. ZIPCode is categorical variable and we have arrived at the state and county from the same.
  2. From the graph we can see few ZIP Codes so higher customer density.
  3. We can exclude ZIPCode from the dependent variables list and use County instead of the ZIPCode to see the distribution based on the group.

State

Observations:

  1. Since the dataset is for California state only the graph shows 100% for CA.
  2. State column can be dropped from Dependent Variables List as it would not add any value to the analysis.

County

Observations:

  1. Maximum customers are from Los Angeles County followed by San Diego and Santa Clara.
  2. This means the customers from Los Angeles County are more likely to use Banking Services.

Family

Observations:

  1. We assume that customers with 1 in Family Column are single and they contribute to 29.4% of data.
  2. Followed by couples 25.9%.
  3. Customer with a family of 3 contribute only 20.2% of the dataset.
  4. This means Single Customers are more likely to use banking services.

CCAvg

Observations:

  1. CCAvg Column shows positive skewness.
  2. The mean is bigger than the median.
  3. Most of the customers monthly avg. spending on credit cards is between USD 0 to USD 2,500. There are very few customers whose monthly avg. spending on credit card is more than USD 8,000.
  4. CCAvg shows outliers and we will try to treat them by scaling the variable.

Education

Observations:

  1. Most customers are Under Graduate, contributing to almost 42% of dataset.
  2. Followed by Graduate and Professional Education groups.

Mortgage

Observations:

  1. Most of the customers do not have any mortgage.
  2. There are more customers with mortgage amount between USD 80,000 and USD 150,000.
  3. There are very few customers with mortgage amount more than USD 600,000.
  4. Mortgage shows heavy outliers and we will try to scale the variable in outliers treatment to check the impact.

Personal Loan - Target Variable

Observations:

  1. We have only 9.6% customers who have availed Personal_Loan facility in past.
  2. That leaves 90.4% customers highlighting that the dataset is biased towards Customers not availing Personal Loan.
  3. This highlights that our model will tend to perform better towards the customers who would not accept the Personal Loan offer.

Securities_Account

Observations:

  1. Only 10.4% Customers have Securities_Accounts with banks.
  2. 89.6% customers are not using this facility.
  3. We can recommend business to market Securities Account and increase the customer base.

CD - Certificate of Deposit

Observations:

  1. Only 6% Customers have Certificate of Deposit with banks.
  2. 94% customers are not using this facility.
  3. We can recommend business to market CD Accounts and increase the customer base.

Online

Observations:

  1. About 60% customers use Online Banking Facility provided by banks.
  2. Near about 40% customers are still not using the Online Banking.
  3. We can recommend business to promote and motivate customers to start using Online Banking, this will help reduce Branch Dependency.

Credit Card

Observations:

  1. Only 23.4% Customers are using Credit Cards.
  2. 70.6% customers are not using this facility.
  3. We can recommend business to market Credit Card products and increase the customer base.

Bivariate Analysis

Personal_Loan VS CreditCard

Observations:

There is marginal difference in percentage of customers using Credit Card applying for credit card, compared to the customers without the credit card.

Personal_Loan VS Age

Observations:

  1. Customer age has very minimum impact on his / her interest to avail Personal Loan.
  2. Customer with 65 years of age are the once who has availed the Personal Loan the most.

Personal_Loan VS Experience

Observations:

  1. Customer Experience has very minimum impact on his / her interest to avail Personal Loan.
  2. Customer with 41 years of experience are the once who has availed the Personal Loan the most.

Personal_Loan VS County

Observations:

  1. Customers from Sonoma_County has highest percentage of users applying for personal loan.
  2. Shasta_County is at number 2.

Personal_Loan VS Family

Observations:

Customers with family of 3 contribute to slightly higher percentage compared to others.

Personal_Loan VS Education

Observations:

  1. Customers with higher education contribute to larger percentage of people applying for personal loan.
  2. Graduate customers have a small portion of people applying for personal loan.

Personal_Loan VS Securities_Account

Observations:

Securities Account has slight impact on the probability of customers applying for personal loan

Personal_Loan VS CD_Account

Observations:

Customer with CD Accounts have shown more interest in Personal Loan.

Personal_Loan VS Online

Observations:

Customer using Online Banking feature has a very little impact on the personal loan interest.

Multi Variate Analysis

Correlation Heat Map

Observations:

  1. The highest correlation of almost 1 is between Age and Experience. We can try building models with and without experience.
  2. CCAvg and Income have a moderate correlation of value 0.65.
  3. The attributes Mortgage & Income and Securities Account & CD Account are moderately correlated.
  4. Income influences CCAvg, Mortgage, CD Account and Personal Loan.
  5. There is also evident association of CD Account with Income, CCAvg, Securities Account, Online and Credit Card.
  6. The target attribute i.e Personal Loan has maximum correlation with Income, CCAvg, Education, Mortgage and CD Account.

Education and Income effect on Personal_Loan

Observations:

  1. Customers who took loan have same range of income irrespective of education level.
  2. Customers with education level of graduate and advanced/professional have higher chances of taking a loan.

Education and Mortgage effect on Personal_Loan

Observation:

Customers of any education level but with high value of house mortgage, have higher chances of taking a loan compared to those of the same education level but with low house mortgage value.

CreditCard and CCAvg effect on Personal_Loan

Observation:

Customers who have credit card and have higher monthly average spend are more likely to avail Personal_loan.

Age and Experience effect on Personal_Loan

Observations:

  1. As we already know Age and Experience are highly correlated with near to 1 correlation.
  2. We can drop the Experience column from the analysis. However for comparison we can build two models with and without experience.

Outliers Treatment

We have seen during Univariate Analysis that Mortgage, Income and CCAvg has outliers. We will try to treat the outliers using the functions defined and see if that reduces the skewness.

Tried the model building without treating the outliers, the performance is better when outliers are treated.

Model Building

Linear Regression

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting that Customer would be interested in Personal Loan but in reality Customer rejects the offer for the Personal Loan.
  2. Predicting that Customer would not be interested in Personal Loan but in reality Customer accepts the offer for the Personal Loan.

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Define the Dependent and Independent Variables

Create dummy variables

Split the Data in training and testing sets with (70:30) ratio

Check the split of Data Set

Chect the split of Personal_Loan

Fit the Logistic Regression Model

Find the coefficients of the Model

Observations:

  1. Coefficient of Age, Securities_Account, Credit_Card and Some Counties are negative, increase in these will lead to decrease in chances of a Customer applying for Personal Loan.
  2. Coefficient all the Experience, Income, Family, CCAVg, Education, Mortgage, CD_Account and Some Counties are Positive, increase in these will lead to increase in chances of Customer applying for Personal Loan.

Convert Coefficients to the Odds

  1. The coefficients of the logistic regression model are in terms of log(odd), to find the odds we have to take the exponential of the coefficients.
  2. Therefore, odds = exp(b)
  3. The percentage change in odds is given as odds = (exp(b) - 1) * 100

Observations:

  1. Change in Unit of age will decrease the odds of Customer applying for the Personal Loan.
  2. Change in Unit of Experience, Income, Family, CCAvg, Education and Mortgage will increase the odds of Customer applying for the Personal Loan.
  3. Customers applying for Securities_Account and Credit_Card will decrease the Odds of customer applying for Personal Loan.
  4. Customers applying for CD_Account will increase the odds of Customer applying for Personal Loan.
  5. Customers if moving to Certain Counties will increase the chance of Personal Loan Applications.

Check the performance of model on Training Set

Check different scores of the model

Check the performance of the model on Testing Set

Check the different scores of the model

ROC-AUC

ROC-AUC on Training Set

ROC-AUC on Testing Set

Observations:

Model is giving Generalized performance on both Training and Testing Sets

Model Performance Improvement

Try to improve F1 Score, by changing the model threshold using AUC-ROC Curve.

Optimal threshold using AUC-ROC curve

Check the performance of model on Training Set

Check the different scores of the model

Check the performance of model on Testing Set

Check the different scores of the model

Observations:

The recall of the model for both training and test set has improved but the F1 score has reduced.

Use Precision-Recall curve and see if we can find a better threshold

Observations:

At 0.35 threshold we get balance precision and recall

Set the optimum threshold

Check the performance of model on Training Set

Check the different scores of the model

Check the performance of model on Testing Set

Check the different scores of the model

Observations:

F1 Score has improved a little

Model Performance comparison

Compare the outcome of all models on Training Set

Compare the outcome of all models on Testing Set

Observations:

  1. Model built with RECALL and PRECESSION VS Threshold gives us the best F1 Score on Testing Set.
  2. Model built with ROC-AUC Curve gives us Best RECALL on both Training Set and Testing Set.

Decision Tree

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting that Customer would be interested in Personal Loan but in reality Customer rejects the offer for the Personal Loan.
  2. Predicting that Customer would not be interested in Personal Loan but in reality Customer accepts the offer for the Personal Loan.

Which case is more important?

Both the cases are important as:

  1. If we predict that Customer would accept personal loan but the Customer rejects the offer, will cost the Banking the marketing effort and incidental costs associated with it.

  2. Since the large part of dataset shows that Customers does not accept the Personal Loan offer the Prediction will also be that most of the Customers would not accept the Personal Loan offer. However going by that principle Bank should not stop marketing the product to such customers. As there would certain cases where Customer may not have availed the Personal Loan in past but in future the requirement may change.

How to reduce this loss i.e need to reduce False Negatives?

recall should be maximized, the greater the f1_score higher the chances of identifying both the classes correctly.

Build Decision Tree Model

  1. We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split.
  2. If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.
  3. In this case, we can pass a dictionary {0:0.15,1:0.85} to the model to specify the weight of each class and the decision tree will give more weightage to class 1.
  4. class_weight is a hyperparameter for the decision tree classifier.

Fit the model on Training Set

Check the performance of model on Training Set

Check the recall score

Observations:

  1. Model is able to perfectly classify all the data points on the training set.
  2. 0 errors on the training set, each sample has been classified correctly.
  3. As we know a decision tree will continue to grow and classify each data point correctly if no restrictions are applied as the trees will learn all the patterns in the training set.
  4. This generally leads to overfitting of the model as Decision Tree will perform well on the training set but will fail to replicate the performance on the test set.

Check the performance of model on Testing Set

Check the recall score

Observations:

  1. We don't see huge difference between the Training and Testing Sets.
  2. It indicates that the Decision Tree is not essentially over fitting in our case.
  3. We can do more analysis to see if the predictions above are correct.

Visualize the Decision Tree

Create a list of columns

Draw the tree

Spool a text report for the rules of Decision Tree

Observations:

Both Visual Tree and Text Spool being complex in nature, are difficult to interpret

Check the important variables used to build the tree

Plot the above data to see the graphical view

Observations:

According to the decision tree Income is the most important variable in predicting the possibility of Customer accepting a Personal Loan Offer.

Reduce Overfitting

Use GridSearch for HyperParameter tuning

  1. Hyperparameter tuning is also tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. i.e we'll use Grid search
  2. Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
  3. It is an exhaustive search that is performed on a the specific parameter values of a model.
  4. The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

Check the performance of model on Training Set

Check the recall score

Check the performance of model on Testing Set

Check the recall score

Observations:

  1. We see the recall score is reduced on Training set, this means the Decision Tree after hyper parameter tuning is giving generalized results.
  2. The recall score on the Testing Set is improved.

Visualize the Decision Tree

Spool a text report for the rules of Decision Tree

Observations:

  1. Customers with Income less that USD 98,500 and CCAvg Spend less than equal to USD 2,950 the customer is likely show interest in Personal Loan
  2. Whereas the customer with CCAvg Spend more than USD 2,950 is less likely to apply for Personal Loan.
  3. Customers with Income more than USD 98,500 and are Graduate with CCAvg Spend less that USD 4,200 and are older than 63 years are likely to show interest in Personal Loan product.
  4. Other decision rules can be interpreted similarly.

Check the Important variables used to build the tree.

Plot the above data to see the graphical view

Observations:

  1. Income is the Most Important variable in the tuned Decision Tree as well.
  2. CD_Account has improved its position over Experience in the tuned tree.

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

Check the Impurities using CCP Alpha for the Decision Tree

Compare the Impurities VS Effective Alphas on Training Set

Train a Decision Tree using the Effective Alphas

Remove the last element from Effective Alphas

Create list of recalls for all effective alphas on Training Set

Create list of recalls for all effective alphas on Testing Set

Get the recall scores on Training and Testing Set for Effective Alphas

Plot recall of Training and Testing Sets

Observations:

  1. Best alpha is around 0.06 from above graph. We can find the best alpha comparing the CCP Alphas to identify the best model.
  2. We can also build a tree with 0.023 to check if we get better recall.

Create a model with highest recall on Training and Testing Sets

Fit the best model on Training Set

Check the performance of the model on Training Set

Check the recall score

Check the performance of the model on Testing Set

Check the recall score

Visualize the Decision Tree

Plot the above information in graphical view

Observations:

  1. The model gives us the best recall.
  2. Income is still the most important variable.
  3. We see CD_Account and Experience no longer appear to be important variables.

Create model with 0.023 CCP_Alpha

Fir the model on Training Set

Check the performance on Training Set

Check the recall score

Check the performance on Testing Set

Check the recall score

Observations:

  1. The results have improved from the initial model.
  2. The performance is comparable to the hyperparameter tuned model.
  3. The model is giving a generalized performance on training and test set.

Visualize the Decision Tree

Spool the Text Report to check the rules of the Decision Tree

Check the Important Variables of the Decision Tree

Plot the above information in graph

Observations:

Income, Education and Family are three important parameters that would influence the customers willingness to accept Personal Loan offer.

Compare all Decision Tree Models

Compare the performance on Training Set

Compare the performance on Testing Set

Observations:

  1. Decision tree model with pre-pruning has given the best recall score on training data.
  2. The pre-pruned and the post-pruned models have reduced overfitting and the model is giving a generalized performance.
  3. We also see the best model and the model built using CCP_Alpha 0.023 gives better recall score on Training set but gives the same recall score on Testing set.
  4. We can either use the Model built using identified alpha or we can use the best model identified by using recall score comparison.

Conclusion

  1. We analyzed the dataset using different techniques, used Logsitic Regression Classifier and Decision Tree Classifier to build a predictive models for the same.
  2. As we know the from business perspective recall is the best measure for the model, decision tree has given the best recall on both Tuned and Post Pruned Trees.
  3. The model built can be used to predict if a customer will accept the Personal Loan offer.
  4. We visualized different trees and their confusion matrix to get a better understanding of the model. Easy interpretation is one of the key benefits of Decision Trees.
  5. We verified the fact that how much less data preparation is needed for Decision Trees and such a simple model gave good results even with outliers and imbalanced classes which shows the robustness of Decision Trees.
  6. Income, CCAvg, Education, Experience and CD_Account are the important variables in predicting whether the Customer will apply for Personal_Loan.

Recommendations

Personal Loan:

  1. Marketing team can start first promoting the Personal Loan product to the customers with income lesser than USD 98,500 per annum.
  2. Marketing team can also focus on customers in higher income segment based on the Education and Age parameters.
  3. Marketing team can try to also identify the usage of the Personal Loan for the customers that have already applied for Personal Loan. For instance if existing Personal Loan customers from age group less than 30 are using the Loan amount for Higher Education, Home Decor or Better Lifestyle. This data can be then used to target those customers specifically.
  4. Senior Customers and Customers with higher education are more likely to avail Personal Loan. Marketing team can try creating product offering with free Health Check or Certain memebership benefits linked with new Loan Applications.
  5. Credit Card customers with higher usage on their cards can also be target to avail Personal Loans.
  6. Campaigns with heavily discounted or waived off processing fees and payment holiday during festive seasons can be run to attract customers.

Other Products and Services:

  1. During out analysis we noticed that other Products like Credit Card, CD Account, Securities Account and Mortgage Loan Account also have potential market. As currently a larger segment of customers are not availing these facilities.
  2. We also noticed that most customers are not using the Online Banking facility. Marketing team and look at User Experince enhancement of the Website and promote the same for more customers to use the same. By way of moving certain branch tasks to online like address update, document submission and some tax related submissions. Enhanced website with better content management can be used to market products like Personal loan, by running Web / App Specific promotions.